For the purposes of this tutorial, I am using R's Airquality dataset. (Attach Link).
Let's do a quick import to get started.
In [1]:
import pandas as pd
import badfish as bf
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
df = pd.read_csv('data/airquality.csv', index_col=0)
We need to convert the Pandas dataframe to Badfish's missframe.
In [3]:
mf = bf.MissFrame(df)
A MissFrame converts your data to a boolean matrix where a missing cell indicates a True value while a filled cell is given a False value.
What are the different functions that can be used with the MissFrame?
In [4]:
dir(mf)
Out[4]:
Lets quickly use Pandas isnull().sum() function to check how many missing values are present in the different columns.
In [5]:
df.isnull().sum()
Out[5]:
All MissFrame methods contain the same structure of arguments.
We can replicate the same functionality of counting the missing values with mf.counts()
In [6]:
mf.counts()
Out[6]:
Now let's make our query a tad more complicated.
What if I wanted to see how many data-cells of Solar,Temp and Wind are missing when Ozone is missing? This gives an idea on how the missing data on one (or more) column affects other columns.
In [7]:
mf.counts(where = ['Ozone'],how = 'any',columns=['Solar.R','Wind','Temp'])
Out[7]:
Okay, so we've got 8 missing cells of Temp, 2 of Wind and Solar each when Ozone goes missing.
What happens when Ozone OR temp go missing? How does it affect the other three?
In [8]:
mf.counts(where=['Ozone','Temp'], how='any', columns=['Solar.R','Wind','Temp'])
Out[8]:
The how = 'any' or how = 'all' controls how the columns are used.
If you want to see the number of missing cell counts in rows where Ozone AND temp go missing-
In [9]:
mf.counts(where = ['Ozone','Temp'],how = 'all',columns=['Solar.R','Wind','Temp'])
Out[9]:
The pattern plot below gives a nice understanding of the amount of data missing with different combinations of samples. Blue tiles indicate the presence of data whereas red tiles indicate missing data.
We see that Ozone has the highest amount of missing data (27 samples) where-as 8 samples are missing a combination of Ozone and Temp data.
Note- The raw counts are given on the left.
In [10]:
mf.plot(kind='pattern', norm = False, threshold=0.0)
In [11]:
mf.pattern(columns = ['Ozone', 'Temp', 'Solar.R'], norm = False, threshold=0.0)
Out[11]:
A tabular function to show which columns seem to go missing together reports these correlations of missing data-
In [12]:
mf.corr(columns = ['Ozone', 'Temp','Wind'])
Out[12]:
Or perhaps let's look at only the correlations of missing data of other columns with Ozone:
In [13]:
mf.corr()['Ozone']
Out[13]:
One of the well known datamining techniques is Association Rule Algorithm. Priori to the association rule generation, frequent itemsets are generated based on the item-item relations from the large data set according to a certain support.
Thus the frequent itemsets of a data set represent strong correlations between different items, and the itemsets represent probabilities for one or more items existing together in the current transaction. If we use the different columns as items, we could find which columns go missing together and generate possibly causal association rules.
In [14]:
mf.frequency_item_set?
In [15]:
itemsets, rules = mf.frequency_item_set(columns = ['Ozone','Temp','Wind'], support=0.01, confidence=0.0)
In [16]:
itemsets
Out[16]:
In [17]:
rules
Out[17]:
In [18]:
mf.cohort(group = ['Ozone'])
Out[18]: